Outline

  1. The WWW is not the Internet
    • A Brief History of the Internet
    • Birth of the WWW
    • Internet, HTML and Browsers
  2. The WWW, Back & Front
    • How is the web written?
    • The structure of a simple webpage
    • The www millefeuille

  3. How-to in R
    • Finding your way around
    • Basic instructions
  4. A (quick) note on APIs
  5. Headless browsers

The WWW is not the Internet

A Brief History of the Internet

  • Competing Histories
  • A “Cold War Technology”
    • DARPA and the race for the future of technology
    • DARPA & ARPA-NET (1958-1969)

A Brief History of the Internet

  • Competing Histories
    • “The Hippies did it”: Libertarian origins
      • Augmenting individuals: D. Engelbart against technoscience
      • Collective collaboration and the rise of hackers: science (and connected computers) for the people.

A Brief History of the Internet

  • Technological & Political Struggles
    • From circuit switching to packet switching
    • A flurry of networks → TCP/IP (1978-1983)
    • Whose Technology?
    • 1980s: NSF grant to connect US universities

Birth of the WWW

  • A network that uses the internet
    • Tim Berners-Lee (1989)
  • A decentralized sorting system
    • Child of the previous evolution
    • The hyperlink at its core
  • HTML & HTTP
    • HTML: Hyper Text Markup Language
    • HTTP: Hyper Text Transfer Protocol

The Birth of the WWW

  • The Internet as a Product
    • IMDB 1990; Amazon, Ebay, Craiglist 1995;
    • Hotmail 1996; Yahoo, Google, Paypal 1998; 2001 Wiki;etc

Internet, HTML & Browsers

  • Browsers were essential to popularizing the WWW

Internet, HTML & Browsers

  • Browsers send, receive and interpret code
    • The web relies on the circulation of text in HTML
    • The www is based on communication between computers via a protocol called HTTP
    • Computers & pages are identified by their address, called a URL (Uniform Resource Locator)
    • HTML files are transferred and subsequently formatted into a legible format.

Internet, HTML & Browsers

  • Browsers visualize code

Internet, HTML & Browsers

  • Browsers visualize code

The WWW, Back & Front

How is the web written?

  • The Web is Premised on HTML
    • A “Rich Language”

    • A Structuring Principle: tags

      • HTML works with tags
        • The text displayed is surrounded by extra information, contained in these containers.
      • Ex. This is very interesting

How is the web written?

  • The Web is Premised on HTML
    • A “Rich Language” (more than meets the eye)
    • A Structuring Principle: tags
    • Tags have a type and an attribute
      • Types are fix (a, span, li, div). They have a limited number of attributes.
      • For more, see this page.

How is the web written?

  • Head & Body

How is the web written?

  • A few common tags
    • <div> : block of text
    • <p> : paragraph
    • <a> : hypertext link
    • <h1> : (resp. h2, h3, h4, h5) titles
    • <!…> : Comment

The structure of a simple webpage

  • An HTML File has a Tree Structure

The structure of a simple webpage

  • What happens in the code is visible on the page

The structure of a simple webpage

  • Tools in your browsers help you “inspect elements”

The WWW millefeuille

  • A webpage is built on HTML

    • And it includes other types of files:
      • Pictures
      • Content Style Sheet (CSS)
      • Javascript

→ Increasingly, the web has become a millefeuille

The WWW millefeuille

→ Increasingly, the web has become a millefeuille

This has consequences for scraping. But keep in mind that a regularity on the screen means regularity in the code. We are going to use this

How-to in R

Finding your way around

There is a wealth of dedicated libraries

This page maintains a list of all that there is at moment (and it’s plenty)

Scraping: httr or rvest

Selecting in HTML (or XML): XML or rvest Selection in json: rjson, rjsonio, jsonlite…

Basic instructions

read_html() will read the page and transform it into an XML document.

Thus,

read_html(“https://sicss.io/2022/paris/schedule”) will output the source code of the schedule page for the SICSS-Paris program.

Basic instructions

Basic instructions

Yes, in ~70% of the cases, all you need to do to scrape a page is to do read_html(“PAGE”)

Basic instructions

It may not be enough, if the website sees the evil crawler in you.

You will need to dress-up like an honest browser

user_agent(“Mozilla/5.0 (Macinstosh;U; Intel MacOS X 10.6; en-US”)

user_agent(“roger.rabbit@gmail.com”)

Basic instructions

Sometimes that won’t be enough, because you’ll need cookies.

With rvest, you will have to create a session, which stores the said cookies and allows you to navigate from there.

And then you’ll use read_html

Basic instructions

Sometimes that won’t be enough, because you’ll need to log in.

You’ll use a function called html_form(),

And then you’ll use read_html

Basic instructions

Once you have done that:

Great news: we are back to square 1!

(You could have copied and pasted the source code in your console, couldn’t you?)

Except that now, it is in R, and you can use it

A (quick) note on APIs

APIs

  • Application Programmers Interface
    • A feature of the web 2.0
    • Legal, and often easier
    • Different forms: login, with rate limits, or just open

APIs

APIs

Most APIs require registration

See the now overoften-used Twitter API for academics

  1. Register for a project
  2. Wait for approval
  3. Get your credentials
  4. Make requests

See Chris Bail’s detailed page on APIs

See academicTwitteR webpage

APIs

Back to legal and deontological matters - If there is an API, use it - If there is no API, ask yourself: are you doing something illegal?

Sure, scraping Twitter is legal, and its content is public. But what do you learn from individuals? And how should you protect them?

Headless Browsers

A quick note on headless browsers

A growing trend in the web industry is to have websites that respond to your behavior (scrolling, clicking, etc).

A quick note on headless browsers

For behavior-based websites

A quick note on headless browsers

To do this, you will need to use Javascript in order to create a “headless browser”, i.e. a browser piloted from your command line.

This is slightly more complex as you need to install other software, but we’ll see an example later.

This is also an easy way to avoid some classic headaches.

A quick note on headless browsers

In R, this is often done using “Selenium”. To do so, install “RSelenium”

For me, it worked better installing Docker too (…)

And you will need to type, in the command line, a few lines of code. See this explanation by Chris Bail.

Conclusion

But all you need to know is read_html() (and where to look for)